Skip to content

use SoftFilelock ot ensure model resouce processed correctly#280

Merged
darkrush merged 16 commits intoccprocessor:devfrom
darkrush:filelock
Mar 10, 2025
Merged

use SoftFilelock ot ensure model resouce processed correctly#280
darkrush merged 16 commits intoccprocessor:devfrom
darkrush:filelock

Conversation

@darkrush
Copy link
Copy Markdown
Collaborator

@darkrush darkrush commented Mar 7, 2025

Using SoftFileLock on NFS makes it safe for multiple processes on multiple nodes to handle data resources (download and decompress).
The commonly used resource management interfaces are uniformly exposed in llm_web_kit/model/resource_utils/init.py , so there is no need to call them one by one from each file.

Test on Spark4 with 1000 partition:

  • update_language_by_str run with an empty cache folder(have to download resource first) with total 546\*1000 QPS (peak parallelism about 350) and rerun(resource already processed) with 725\*1000 QPS
  • update_political_by_str run with an empty cache folder(have to download resource and unzip files first) with 222\*1000 QPS (peak parallelism about 100) total and rerun(resource already processed) with 261\*1000 QPS

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 7, 2025

Codecov Report

Attention: Patch coverage is 91.93548% with 15 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
..._web_kit/model/resource_utils/process_with_lock.py 91.11% 4 Missing ⚠️
llm_web_kit/model/code_detector.py 40.00% 3 Missing ⚠️
llm_web_kit/model/policical.py 77.77% 2 Missing ⚠️
...lm_web_kit/model/resource_utils/download_assets.py 96.55% 2 Missing ⚠️
llm_web_kit/model/resource_utils/unzip_ext.py 93.54% 2 Missing ⚠️
llm_web_kit/model/porn_detector.py 88.88% 1 Missing ⚠️
llm_web_kit/model/resource_utils/utils.py 95.83% 1 Missing ⚠️

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #280      +/-   ##
==========================================
- Coverage   90.68%   89.92%   -0.76%     
==========================================
  Files         116       85      -31     
  Lines        6914     5630    -1284     
==========================================
- Hits         6270     5063    -1207     
+ Misses        644      567      -77     
Files with missing lines Coverage Δ
llm_web_kit/model/html_layout_cls.py 90.90% <100.00%> (-0.21%) ⬇️
llm_web_kit/model/lang_id.py 65.38% <100.00%> (+0.76%) ⬆️
llm_web_kit/model/quality_model.py 84.65% <100.00%> (ø)
llm_web_kit/model/unsafe_words_detector.py 94.35% <100.00%> (ø)
llm_web_kit/model/porn_detector.py 74.11% <88.88%> (-0.31%) ⬇️
llm_web_kit/model/resource_utils/utils.py 96.42% <95.83%> (ø)
llm_web_kit/model/policical.py 55.00% <77.77%> (+3.64%) ⬆️
...lm_web_kit/model/resource_utils/download_assets.py 95.78% <96.55%> (+4.12%) ⬆️
llm_web_kit/model/resource_utils/unzip_ext.py 95.74% <93.54%> (+0.09%) ⬆️
llm_web_kit/model/code_detector.py 51.80% <40.00%> (-1.14%) ⬇️
... and 1 more

... and 57 files with indirect coverage changes

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@darkrush darkrush merged commit 8653649 into ccprocessor:dev Mar 10, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant